In [1]:

    
# Allow us to load `open_cp` without installing
import sys, os.path
sys.path.insert(0, os.path.abspath(os.path.join("..", "..")))

Data

The data can be downloaded from https://catalog.data.gov/dataset/crimes-2001-to-present-398a4 (see the module docstring of open_cp.sources.chicago See also https://data.cityofchicago.org/Public-Safety/Crimes-2001-to-present/ijzp-q8t2

The total data sets (for all crime events 2001 onwards) give different files between these two sources. We check that they do contain the same data.



In [2]:

    
import sys, os, csv, lzma
import open_cp.sources.chicago as chicago

filename = os.path.join("..", "..", "open_cp", "sources", "chicago.csv")
filename1 = os.path.join("..", "..", "open_cp", "sources", "chicago1.csv")
filename_all = os.path.join("..", "..", "open_cp", "sources", "chicago_all.csv.xz")
filename_all1 = os.path.join("..", "..", "open_cp", "sources", "chicago_all1.csv.xz")

Check the total data sets agree

The files filename and filename1 were downloaded from, respectively, the US Gov website, and the Chicago site. They are slightly different in size, but appear to contain the same data. (This can be checked!)

The files filename_all and filename_all1 were also downloaded from, respectively, the US Gov website, and the Chicago site. While they are the same size (uncompressed), and have the same headers, the data appears, at least naively, to be different.



In [3]:

    
with lzma.open(filename_all, "rt") as file:
    print(next(file))
with lzma.open(filename_all1, "rt") as file:
    print(next(file))









    



ID,Case Number,Date,Block,IUCR,Primary Type,Description,Location Description,Arrest,Domestic,Beat,District,Ward,Community Area,FBI Code,X Coordinate,Y Coordinate,Year,Updated On,Latitude,Longitude,Location

ID,Case Number,Date,Block,IUCR,Primary Type,Description,Location Description,Arrest,Domestic,Beat,District,Ward,Community Area,FBI Code,X Coordinate,Y Coordinate,Year,Updated On,Latitude,Longitude,Location



In [4]:

    
with lzma.open(filename_all, "rt") as file:
    next(file); print(next(file))
with lzma.open(filename_all1, "rt") as file:
    next(file); print(next(file))









    



4652043,HL594701,09/06/2005 12:06:44 PM,004XX E 61ST ST,1811,NARCOTICS,POSS: CANNABIS 30GMS OR LESS,OTHER,true,false,0313,003,20,42,18,1180151,1864661,2005,04/15/2016 08:55:02 AM,41.783897141,-87.61504023,"(41.783897141, -87.61504023)"

7806704,HS617397,11/15/2010 07:05:00 PM,040XX W WILCOX ST,2024,NARCOTICS,POSS: HEROIN(WHITE),SIDEWALK,true,false,1115,011,28,26,18,1149516,1899031,2010,02/04/2016 06:33:39 AM,41.87886043,-87.726470068,"(41.87886043, -87.726470068)"

Compare the actual contents of the files.

This is rather memory intensive, so we go to a little effort to use less RAM.



In [5]:

    
# NB: These methods encode a missing geometry and (-1, -1)

def yield_tuples(f):
    for feature in chicago.generate_GeoJSON_Features(f, type="all"):
        props = feature["properties"]
        if props["crime"] == "HOMICIDE":
            continue
        coords = feature["geometry"]
        if coords is None:
            coords = (-1, -1)
        else:
            coords = coords["coordinates"]
        event = (props["case"], props["crime"], props["type"], props["location"],
                 props["timestamp"], props["address"], coords[0], coords[1])
        yield event

def load_as_tuples(f):
    events = list(yield_tuples(f))

def load_as_dict_to_lists(f):
    events = dict()
    for event in yield_tuples(f):
        case = event[0]
        if case not in events:
            events[case] = []
        events[case].append(event[1:])
    return events



In [6]:

    
def compare_one_other(file1, file2):
    in_only1 = []
    in_only2 = []
    with lzma.open(file1, "rt") as file:
        events = load_as_dict_to_lists(file)
    
    with lzma.open(file2, "rt") as file:
        for event in yield_tuples(file):
            case, e = event[0], event[1:]
            if case not in events or e not in events[case]:
                in_only2.append(event)
                continue
            events[case].remove(e)
            if len(events[case]) == 0:
                del events[case]
                
    for case, e in events.items():
        in_only1.append( (case,) + e )
    
    return in_only1, in_only2



In [7]:

    
compare_one_other(filename_all, filename_all1)









    Out[7]:





([], [])

Check that the data is encoded in the expected way

ID,Case Number,Date,Block,IUCR,Primary Type,Description,Location Description,Arrest,Domestic,Beat,District,Ward,Community Area,FBI Code,X Coordinate,Y Coordinate,Year,Updated On,Latitude,Longitude,Location



In [38]:

    
import pyproj, numpy
proj = pyproj.Proj({'init': 'epsg:3435'}, preserve_units=True)

def check_file(file):
    reader = csv.reader(file)
    header = next(reader)
    assert header[15] == "X Coordinate"
    assert header[16] == "Y Coordinate"
    assert header[19] == "Latitude"
    assert header[20] == "Longitude"
    assert header[21] == "Location"
    
    for row in reader:
        x, y = row[15], row[16]
        lat, lon, latlon = row[19], row[20], row[21]
        if x == "":
            assert y == ""
            assert lat == ""
            assert lon == ""
            assert latlon == ""
        else:
            assert latlon == "(" + lat + ", " + lon + ")"
            xx, yy = proj(float(lon), float(lat))
            assert int(x) == numpy.round(xx)
            assert int(y) == numpy.round(yy)



In [39]:

    
with lzma.open(filename_all, "rt") as file:
    check_file(file)



In [40]:

    
with lzma.open(filename_all1, "rt") as file:
    check_file(file)

Compare the full dataset with the extract

Let us compare the last 12 months data with the full dataset.

There are a few differences, but they really are "few" compared to the size of the complete dataset. There appears to be no pattern in the differences.



In [9]:

    
with lzma.open(filename_all, "rt") as file:
    all_events = load_as_dict_to_lists(file)



In [10]:

    
frame = chicago.load_to_geoDataFrame()
frame.head()









    Out[10]:






  
    
      
      address
      case
      crime
      geometry
      location
      timestamp
      type
    
  
  
    
      0
      010XX N CENTRAL PARK AVE
      HZ560767
      OTHER OFFENSE
      POINT (-87.71645415899999 41.899712716)
      APARTMENT
      2016-12-22T02:55:00
      VIOLATE ORDER OF PROTECTION
    
    
      1
      051XX S WASHTENAW AVE
      HZ561134
      BATTERY
      POINT (-87.691539994 41.800445234)
      RESIDENTIAL YARD (FRONT/BACK)
      2016-12-22T11:17:00
      AGGRAVATED: OTHER FIREARM
    
    
      2
      059XX W DIVERSEY AVE
      HZ565584
      DECEPTIVE PRACTICE
      POINT (-87.774165121 41.931166274)
      RESIDENCE
      2016-12-09T12:00:00
      FINANCIAL IDENTITY THEFT $300 AND UNDER
    
    
      3
      001XX N STATE ST
      HZ561772
      THEFT
      POINT (-87.62787669799999 41.883500187)
      DEPARTMENT STORE
      2016-12-22T18:50:00
      RETAIL THEFT
    
    
      4
      008XX N MICHIGAN AVE
      HZ561969
      THEFT
      POINT (-87.624095634 41.897982937)
      SMALL RETAIL STORE
      2016-12-22T19:20:00
      RETAIL THEFT



In [11]:

    
known_diffs = {"JA233208", "JA228951", "JA249656", "JA256373", "JA256594", "JA256838"}

not_found = []

for index, row in frame.iterrows():
    if row.crime == "HOMICIDE":
        continue
    if row.case in known_diffs:
        continue
    if row.case not in all_events:
        not_found.append(row.case)
        continue
    event = all_events[row.case]
    if len(event) > 1:
        print("Doubled, skipping:", row.case)
        continue
    event = event[0]
    assert(row.address == event[4])
    assert(row.crime == event[0])
    assert(row.location == event[2])
    assert(row.timestamp == event[3])
    assert(row.type == event[1])
    if row.geometry is not None:
        assert(row.geometry.coords[0][0] == event[5])
        assert(row.geometry.coords[0][1] == event[6])



In [12]:

    
not_found









    Out[12]:





[]



In [14]:

    
frame[frame.case.map(lambda x : x in known_diffs)]









    Out[14]:






  
    
      
      address
      case
      crime
      geometry
      location
      timestamp
      type
    
  
  
    
      36555
      051XX N DAMEN AVE
      JA233208
      KIDNAPPING
      POINT (-87.679374655 41.975099794)
      SIDEWALK
      2017-04-21T07:25:00
      UNLAWFUL RESTRAINT
    
    
      44245
      086XX S DORCHESTER AVE
      JA228951
      MOTOR VEHICLE THEFT
      POINT (-87.590285761 41.737308941)
      GAS STATION
      2017-04-17T21:45:00
      AUTOMOBILE
    
    
      45752
      051XX N KILDARE AVE
      JA249656
      BURGLARY
      POINT (-87.735603868 41.974363533)
      RESIDENCE-GARAGE
      2017-05-03T09:00:00
      FORCIBLE ENTRY
    
    
      45776
      010XX W 14TH ST
      JA256373
      MOTOR VEHICLE THEFT
      None
      RESIDENCE
      2017-04-25T17:00:00
      AUTOMOBILE
    
    
      45903
      071XX S CORNELL AVE
      JA256594
      DECEPTIVE PRACTICE
      None
      APARTMENT
      2016-06-10T06:00:00
      FINANCIAL IDENTITY THEFT OVER $ 300
    
    
      46028
      029XX N SHEFFIELD AVE
      JA256838
      DECEPTIVE PRACTICE
      None
      RESTAURANT
      2017-04-06T10:00:00
      FINANCIAL IDENTITY THEFT OVER $ 300



In [ ]:

	address	case	crime	geometry	location	timestamp	type
0	010XX N CENTRAL PARK AVE	HZ560767	OTHER OFFENSE	POINT (-87.71645415899999 41.899712716)	APARTMENT	2016-12-22T02:55:00	VIOLATE ORDER OF PROTECTION
1	051XX S WASHTENAW AVE	HZ561134	BATTERY	POINT (-87.691539994 41.800445234)	RESIDENTIAL YARD (FRONT/BACK)	2016-12-22T11:17:00	AGGRAVATED: OTHER FIREARM
2	059XX W DIVERSEY AVE	HZ565584	DECEPTIVE PRACTICE	POINT (-87.774165121 41.931166274)	RESIDENCE	2016-12-09T12:00:00	FINANCIAL IDENTITY THEFT $300 AND UNDER
3	001XX N STATE ST	HZ561772	THEFT	POINT (-87.62787669799999 41.883500187)	DEPARTMENT STORE	2016-12-22T18:50:00	RETAIL THEFT
4	008XX N MICHIGAN AVE	HZ561969	THEFT	POINT (-87.624095634 41.897982937)	SMALL RETAIL STORE	2016-12-22T19:20:00	RETAIL THEFT

	address	case	crime	geometry	location	timestamp	type
36555	051XX N DAMEN AVE	JA233208	KIDNAPPING	POINT (-87.679374655 41.975099794)	SIDEWALK	2017-04-21T07:25:00	UNLAWFUL RESTRAINT
44245	086XX S DORCHESTER AVE	JA228951	MOTOR VEHICLE THEFT	POINT (-87.590285761 41.737308941)	GAS STATION	2017-04-17T21:45:00	AUTOMOBILE
45752	051XX N KILDARE AVE	JA249656	BURGLARY	POINT (-87.735603868 41.974363533)	RESIDENCE-GARAGE	2017-05-03T09:00:00	FORCIBLE ENTRY
45776	010XX W 14TH ST	JA256373	MOTOR VEHICLE THEFT	None	RESIDENCE	2017-04-25T17:00:00	AUTOMOBILE
45903	071XX S CORNELL AVE	JA256594	DECEPTIVE PRACTICE	None	APARTMENT	2016-06-10T06:00:00	FINANCIAL IDENTITY THEFT OVER $ 300
46028	029XX N SHEFFIELD AVE	JA256838	DECEPTIVE PRACTICE	None	RESTAURANT	2017-04-06T10:00:00	FINANCIAL IDENTITY THEFT OVER $ 300